DDPM: A Dengue Disease Prediction and Diagnosis Model Using Sentiment Analysis and Machine Learning Algorithms

The aedes mosquito-borne dengue viruses cause dengue fever, an arboviral disease (DENVs). In 2019, the World Health Organization forecasts a yearly occurrence of infections from 100 million to 400 million, the maximum number of dengue cases ever testified worldwide, prompting WHO to label the virus one of the world’s top ten public health risks. Dengue hemorrhagic fever can progress into dengue shock syndrome, which can be fatal. Dengue hemorrhagic fever can also advance into dengue shock syndrome. To provide accessible and timely supportive care and therapy, it is necessary to have indispensable practical instruments that accurately differentiate Dengue and its subcategories in the early stages of illness development. Dengue fever can be predicted in advance, saving one’s life by warning them to seek proper diagnosis and treatment. Predicting infectious diseases such as dengue is difficult, and most forecast systems are still in their primary stages. In developing dengue predictive models, data from microarrays and RNA-Seq have been used significantly. Bayesian inferences and support vector machine algorithms are two examples of statistical methods that can mine opinions and analyze sentiment from text. In general, these methods are not very strong semantically, and they only work effectively when the text passage inputs are at the level of the page or the paragraph; they are poor miners of sentiment at the level of the sentence or the phrase. In this research, we propose to construct a machine learning method to forecast dengue fever.


Introduction
The Aedes aegypti mosquito handles transmitting the DENV virus, the causative agent of dengue fever, from person to person. There is currently no vaccination that can protect against all virus serologies. This is because there is no such thing as a vaccine. As a direct consequence, trying to reduce the number of mosquitoes in an area has become the primary focus of the fight against the disease. Researchers are using machine learning (ML) and deep learning (DL) to forecast dengue cases and assist governments in their fight against the disease [1].
Dengue virus is a flavivirus, a genus of flaviviruses, and a family of Flaviviridae [2,3]. Arthropods are the primary vectors for the spread of the dengue virus. It can be broken down into four serotypes, referred to by the names DEN 1, DEN 2, DEN 3, and DEN 4. According to the World Health Organization (WHO), dengue fever poses a considerable radiologists is dangerous; instead, it helps physicians provide more accurate diagnoses to their patients. The subfield of computer vision, known as deep learning, is considered an advanced subfield. The primary aim of computer vision is to carry out a variety of tasks simultaneously, including picture detection and recognition, image analysis, natural language processing, and other similar activities. Over the course of the past few years, interest in computer vision has grown substantially across a variety of academic domains. CNN is used in most computer vision tasks, particularly those involving the classification, recognition, and segmentation of medical pictures. The convolutional neural network (CNN) is a sort of artificial neural network that was developed specifically for processing data related to images and videos. It begins with photographs as input, then extracts and learns features from those images, and then classifies output images depending on the features it has learned from the input images. There have been several different CNN-based model ideas put forward, including AlexNet, SPP-Net, VGGNet, ResNet, GoogleLeNet, and others. Deep convolutional neural network (CNN)-based algorithms have shown promising results in the processing of medical pictures. An introduction to CNN in medical imaging analysis as well as a general discussion of machine learning and deep learning applied to medical pictures are included in this work. The researchers investigate several different machine learning methods, and general additive modeling is just one of them.
The contribution of work: The purpose of this paper is to pursue an early diagnostic model that helps doctors in the prompt prognosis and diagnosis of dengue disease by using machine learning algorithms. The key steps are as follows: 1.
Using techniques from the field of machine learning, such as the KNN classifier, decision tree, random forest, Gaussian naive Bayes, and support vector classifier (SVC), among others.

2.
Creating a diagnostic model based on machine learning for fast detection and prognosis of dengue disease to aid medical professionals in making decisions.

3.
The K-Fold method is used here for the purpose of result validation.

Related Work
The field of computing known as machine learning (ML) enables computers to access information without the requirement for any encoding [27,28]. The study of ML falls under the umbrella of the discipline of computer science. MML has become all-pervasive and essential for resolving intricate problems in any science department, but especially in the field of illness diagnostics [29,30]. Machine learning algorithms and techniques will soon be able to foresee and differentiate between a wide variety of illnesses in the healthcare field [30][31][32]. This is a direct effect of ongoing technological improvement. Machine learning is often cited as one of the most productive research approaches, mainly when predicting disease occurrence. There are several distinct kinds of ML algorithms, each of which is capable of being applied for the purpose of disease forecasting [33,34]. The findings of an investigation into several different machine learning algorithm approaches are shown in Table 1, along with the research that is pertinent to the topic. According to the findings of the review that was carried out [35,36], several distinct machine learning methods, including SVM, KNN, R.F., D.T., and SVC, are utilized and evaluated for the purpose of dengue prediction.

Materials and Methods
Within the scope of this publication, we constructed a diagnostic and prognostic model for dengue fever. We broke the task down into steps, beginning with the first phase of data collection, then moving on to data preprocessing, and finally employing ML classifiers to evaluate the output according to the accuracy (mean) of disease prediction (see Figure 1).
Diagnostics 2023, 13, x FOR PEER REVIEW 6 of 16 of data collection, then moving on to data preprocessing, and finally employing ML classifiers to evaluate the output according to the accuracy (mean) of disease prediction (see Figure 1).

Figure 1.
The workflow for the implementation of the proposed diagnostic model for the diagnosis of dengue disease.

Data Collection
The objective is to accurately predict the total number of dengue cases present in the test set, which will be labeled against each city, year, and week of the year. This study uses data from the DengAI competition (open data of dengue illness competition: DengAI: Predicting Disease Spread (drivendata.org)). The DengAI competition comprises data for two cities, San Juan and Iquitos, extending from three to five years. Every piece of information contributes its own set of forecasts for these cities. The data are separated into two categories: the training and test datasets, as shown in Table 2.

Data Preprocessing
The machine learning pipeline's most significant component is the step known as "data preprocessing." Data preprocessing converts unprocessed data into processed (meaningful) data. The dataset needs to be cleaned, normalized, and completely free of noise before it can be used for analysis (see Figure 2).

Data Collection
The objective is to accurately predict the total number of dengue cases present in the test set, which will be labeled against each city, year, and week of the year. This study uses data from the DengAI competition (open data of dengue illness competition: DengAI: Predicting Disease Spread (drivendata.org)). The DengAI competition comprises data for two cities, San Juan and Iquitos, extending from three to five years. Every piece of information contributes its own set of forecasts for these cities. The data are separated into two categories: the training and test datasets, as shown in Table 2.

Data Preprocessing
The machine learning pipeline's most significant component is the step known as "data preprocessing." Data preprocessing converts unprocessed data into processed (meaningful) data. The dataset needs to be cleaned, normalized, and completely free of noise before it can be used for analysis (see Figure 2).

Features Selection
In building a prediction model, one of the most critical steps is called "feature selection." During this phase, the number of variables (or inputs) is narrowed down to reduce the amount of computing required for the modeling process and, in some cases, to improve the overall performance of the model. The dataset has missing data for certain of its attributes, so we use the mean method to replace those values. After that, we use the fit and transform method to normalize and standardize the data.

Data Preprocessing
The machine learning pipeline's most significant component is the step known as "data preprocessing." Data preprocessing converts unprocessed data into processed (meaningful) data. The dataset needs to be cleaned, normalized, and completely free of noise before it can be used for analysis (see Figure 2).  We can see that there are several different features that have extreme values by looking at Figure 3. After investigating the data, it became clear that they are neither outliers nor errors; hence, we are unable to disregard them and will have to take them into consideration. The values of precipitation are taken into consideration here, and given that these are estimates of the amount of rain, it is reasonable to anticipate that the weather can vary significantly depending on the location.

Features Selection
In building a prediction model, one of the most critical steps is called "feature selection." During this phase, the number of variables (or inputs) is narrowed down to reduce the amount of computing required for the modeling process and, in some cases, to improve the overall performance of the model. The dataset has missing data for certain of its attributes, so we use the mean method to replace those values. After that, we use the fit and transform method to normalize and standardize the data.
We can see that there are several different features that have extreme values by looking at Figure 3. After investigating the data, it became clear that they are neither outliers nor errors; hence, we are unable to disregard them and will have to take them into consideration. The values of precipitation are taken into consideration here, and given that these are estimates of the amount of rain, it is reasonable to anticipate that the weather can vary significantly depending on the location. The features reanalysis_avg_temp_k and reanalysis_specific_humidity_g_per_kg appear to be pretty similar in shape; nonetheless, the question that arises here is whether or not they are correlated with one another.
By looking at Figure 4, we can come to the conclusion that certain features are perfectly associated with one another (1), while other features are practically perfectly correlated with one another (0.9). The same information is presented in Table 3. The features reanalysis_avg_temp_k and reanalysis_specific_humidity_g_per_kg appear to be pretty similar in shape; nonetheless, the question that arises here is whether or not they are correlated with one another.
By looking at Figure 4, we can come to the conclusion that certain features are perfectly associated with one another (1), while other features are practically perfectly correlated with one another (0.9). The same information is presented in Table 3.

Correlation
(1/0.9) reanalysis_sat_precip_amt_mm and precipitation_amt_mm 1 reanalysis_specific_humidity_g_per_kg and reanalysis_dew_point_temp_k 1 ndvi_nw 0.9 ndvi_ne 0.9 reanalysis_avg_temp_k 0.9 reanalysis_air_temp_k 0.9 reanalysis_tdtr_k 0.9 reanalysis_max_air_temp_k 0.9 station_diur_temp_rng_c 0.9 reanalysis_tdtr_k 0.9 As we want to detect dengue in this manuscript for the same, if features are far in two cities, then it is suitable for ML classification (reanalysis_tdtr_k); otherwise, if they are near and give mixed information about features, then it is not considered suitable for prediction/classification. From Figures 5 and 6, and from the dataset, we can create a new data frame, i.e., X_train plus the total_cases column of y_train.
Diagnostics 2023, 13, x FOR PEER REVIEW 9 of 16 As we want to detect dengue in this manuscript for the same, if features are far in two cities, then it is suitable for ML classification (reanalysis_tdtr_k); otherwise, if they are near and give mixed information about features, then it is not considered suitable for prediction/classification. From Figures 5 and 6, and from the dataset, we can create a new data frame, i.e., X_train plus the total_cases column of y_train.   As we want to detect dengue in this manuscript for the same, if features are far in two cities, then it is suitable for ML classification (reanalysis_tdtr_k); otherwise, if they are near and give mixed information about features, then it is not considered suitable for prediction/classification. From Figures 5 and 6, and from the dataset, we can create a new data frame, i.e., X_train plus the total_cases column of y_train.   After applying all the above-mentioned steps, we deduce the features from the dataset, as shown in Table 4. In this feature selection, we are dropping out the two features, i.e., reanalysis_sat_pre-cip_amt_mm and reanalysis_specific_humidity_g_per_kg. At this time, we are not considering them because they are almost perfectly correlated (0.9), and we want to achieve good accuracy. However, with the present scenario, if we go for machine learning algorithms, i.e., KNN, D.T., R.F., and GNB, the accuracy comes out to be significantly less. This is due to the total number of cases immensely varying from 0 to 400+. The question arises, "How can we improve this accuracy?" The answer to this question is to divide our dataset into two cities, as shown in Table 4.
After this, we will find the correlation for two different cities, i.e., San Juan and Iquitos, separately, shown in Figure 7. After finding the correlation between the two cities, we can deduce some information, such as that both cities showed promising results for reanalysis_specific_humidity_g_per_kg reanalysis_dew_point_temp_k reanalysis_min_air_temp_k The fact that they are perfectly correlated with each other (value 1) is a clear sign that they are. This says that mosquitoes live in areas with high humidity. Since temperature After finding the correlation between the two cities, we can deduce some information, such as that both cities showed promising results for reanalysis_specific_humidity_g_per_kg reanalysis_dew_point_temp_k reanalysis_min_air_temp_k The fact that they are perfectly correlated with each other (value 1) is a clear sign that they are. This says that mosquitoes live in areas with high humidity. Since temperature plays a vital role in the spread of mosquitoes, it is correlated both with each other and with the total number of cases. Surprisingly, the weakest part of the year is also highly correlated to San Juan City, and as a result, we will be keeping a close eye on that. In addition, if we plot "a number of years" against "week of the year," we find that there is an outbreak at the end of the year in both cities. We arrived at this conclusion after outlining the plot between the two variables. The number of reported cases grows, and outbreaks often occur over a few weeks, as illustrated in Table 5 and Figures 8 and 9, respectively. Table 5. Increase in outbreaks and cases in two cities (San Juan/Iquitos) in weeks.

Increase in Cases (Range in Weeks) Increase in Outbreak (Range in Weeks)
San Juan 35th-45th 35th-45th Iquitos 45th-50th 45th-50th Diagnostics 2023, 13, x FOR PEER REVIEW 12 of 16 Table 5. Increase in outbreaks and cases in two cities (San Juan/Iquitos) in weeks.

Results of Different Classifiers
In this instance, we are using a variety of machine learning classifiers, beginning with KNN and moving on to decision trees, random forests, Gaussian neighbor boundaries, and support vector classifiers. In this instance, we are utilizing k-fold cross-validation to  Table 5. Increase in outbreaks and cases in two cities (San Juan/Iquitos) in weeks.

Results of Different Classifiers
In this instance, we are using a variety of machine learning classifiers, beginning with KNN and moving on to decision trees, random forests, Gaussian neighbor boundaries, and support vector classifiers. In this instance, we are utilizing k-fold cross-validation to

Results of Different Classifiers
In this instance, we are using a variety of machine learning classifiers, beginning with KNN and moving on to decision trees, random forests, Gaussian neighbor boundaries, and support vector classifiers. In this instance, we are utilizing k-fold cross-validation to partition the data into ten equal portions for the purpose of classification. As a direct consequence of this, the mean value obtained after ten iterations is shown in Table 6. Table 6. Analysis result of different machine learning classifiers to classify dengue disease by using 10-K-fold cross-validation and mean as a result of 10 iterations. It can be seen rather plainly from Table 6 that the random forest classifier is the one that turns out to be the best one, with a mean score of 8.72. For a more transparent illustration of the ranking of classifiers (see Figure 10). It can be seen rather plainly from Table 6 that the random forest classifier is th that turns out to be the best one, with a mean score of 8.72. For a more transparen tration of the ranking of classifiers (see Figure 10).

Discussion and Conclusions
As a result of its popularity and widespread application in image segmentation learning has developed into a crucial instrument and is able to achieve ever-higher of precision. However, the primary concern is centered on the optimization of deep ing, and optimization encompasses multiple levels. Some of these levels include p ing the deep network architectures and carrying out ensembled learning; hyperpara tuning, which is an empirical method; optimizing the loss function in accordanc evaluation metrics; and making use of the appropriate optimizer and activation fun The purpose of this research is to develop a diagnostic model for the disease d by using machine learning techniques such as KNN, D.T., R.B., SVR, and GNB. The will be able to make correct predictions regarding the progression of the disease a as allow for early diagnosis of the disease. As a result of these upcoming initiativ focus of prioritization should be placed on cause-effect models for the diagnosis ease. Not only is it vital to diagnose the sickness, but it is also essential to analy elements that have the most considerable influence on the infection. It is essential both things in order to be successful. A more profound comprehension of the etiol the disease, along with the creation of more correct diagnostic models, would be mendous assistance in the fight against dengue fever, as well as in the reduction o plications and fatalities caused by the disease. The use of modeling for the purp minimizing the impact of data uncertainty is another vital area. One of the primary lenges that must be surmounted before the quality of previously developed mode be enhanced is the poor standard of epidemiological data about dengue. As a last c

Discussion and Conclusions
As a result of its popularity and widespread application in image segmentation, deep learning has developed into a crucial instrument and is able to achieve ever-higher levels of precision. However, the primary concern is centered on the optimization of deep learning, and optimization encompasses multiple levels. Some of these levels include perfecting the deep network architectures and carrying out ensembled learning; hyperparameter tuning, which is an empirical method; optimizing the loss function in accordance with evaluation metrics; and making use of the appropriate optimizer and activation functions.
The purpose of this research is to develop a diagnostic model for the disease dengue by using machine learning techniques such as KNN, D.T., R.B., SVR, and GNB. The model will be able to make correct predictions regarding the progression of the disease as well as allow for early diagnosis of the disease. As a result of these upcoming initiatives, the focus of prioritization should be placed on cause-effect models for the diagnosis of disease. Not only is it vital to diagnose the sickness, but it is also essential to analyze the elements that have the most considerable influence on the infection. It is essential to do both things in order to be successful. A more profound comprehension of the etiology of the disease, along with the creation of more correct diagnostic models, would be of tremendous assistance in the fight against dengue fever, as well as in the reduction of complications and fatalities caused by the disease. The use of modeling for the purpose of minimizing the impact of data uncertainty is another vital area. One of the primary challenges that must be surmounted before the quality of previously developed models can be enhanced is the poor standard of epidemiological data about dengue. As a last consideration, the use of independent loops of data analysis works to automate the decision-making process in disease control. Although the D.T., KNN, SVR, and GNB methods all generate better results, the R.F. method requires significantly more time to compute since it generates superior results. Based on the findings, it appears that the R.F. technique is the one to choose. Because of this, it has been determined that, out of all these various machine learning algorithms, the RF-based diagnostic model is the one that is best suited for accurately diagnosing dengue fever at an earlier stage. This conclusion was reached because of the reasons.
The substantial number of optimization factors and schemes that needed to be conducted empirically in order to give our final design requirements were the primary obstacles that needed to be overcome in this effort. Even if we have scaled back the trainable parameters of the network such that they are more compatible with the hardware, there is still the issue of the significant amount of CPU power that must be present to complete the training.
In conclusion, we can say that reason-based models can help with the analysis and interpretation of dengue disease data. This is something that we can assert. Because there is a severe lack of high-quality data in the field of healthcare, machine learning models that can deal with ambiguity can be highly valuable. In conclusion, data decentralization, in conjunction with aggregated learning, may make it possible to cut the costs of computer modeling and may also make it possible to do so without compromising the data's integrity. This may be possible.